SITE LINK

KMID : 1137820220430020109

의공학회지
2022 Volume.43 No. 2 p.109 ~ p.115

Comparative Analysis of Vectorization Techniques in Electronic Medical Records Classification

Yoo Sung-Lim

Abstract

Purpose: Medical records classification using vectorization techniques plays an important role in natural language processing. The purpose of this study was to investigate proper vectorization techniques for electronic med- ical records classification.

Material and methods: 403 electronic medical documents were extracted retrospectively and classified using the cosine similarity calculated by Scikit-learn (Python module for machine learning) in Jupyter Notebook. Vectors for medical documents were produced by three different vectorization techniques (TF-IDF, latent sematic analysis and Word2Vec) and the classification precisions for three vectorization techniques were evaluated. The Kruskal-Wallis test was used to determine if there was a significant difference among three vectorization tech- niques.

Results: 403 medical documents were relevant to 41 different diseases and the average number of documents per diagnosis was 9.83 (standard deviation=3.46). The classification precisions for three vectorization techniques were 0.78 (TF-IDF), 0.87 (LSA) and 0.79 (Word2Vec). There was a statistically significant difference among three vec- torization techniques.

Conclusions: The results suggest that removing irrelevant information (LSA) is more efficient vectorization technique than modifying weights of vectorization models (TF-IDF, Word2Vec) for medical documents classification.

KEYWORD

Natural language processing, Medical records classification, Vectorization techniques, Machine learning, Latent semantic analysis

FullTexts / Linksout information

Listed journal information

site infomation

Prohibition of Unauthorized Collection of E-mail Addresses, medric.kyung@gmail.com
N4 301, Chungbuk National University, Chungdae-ro 1, Seowon-Gu, Cheongju, Chungbuk 28644, Korea